Cluster-based similarity aggregation for ontology matching

نویسندگان

  • Quang-Vinh Tran
  • Ryutaro Ichise
  • Bao-Quoc Ho
چکیده

CSA is an automatic similarity aggregating system for ontology matching. The systems have two main part. The first part is calculation and combination of different similarity measures. The second part is the alignment extraction. The system first calculate five different basic measures to create five similarity matrix, i.e, String based similarity measure, WordNet based similarity measure... Next, it exploits the advantage of each measure through a weight estimation process. Then these similarity matrix is combined into a final similarity matrix. After that, the pre-alignment is extract from this matrix. Finally the pruning process is applied to increase the accuracy of the system. 1 Presentation of the system Ontologies are widely use to provide semantic to the data in the new internet environment. Since they are created by different user for different purpose, It is necessary to develop a method to match multiple ontologies for integrating data from different resources [2]. 1.1 State, purpose, general statement CSA (Cluster-based Similarity Aggregation) is the automatic weight aggregating system for ontology alignment. The system is designed to search for semantic correspondence between heterogeneous data sources from different ontologies. The current implementation only support one-to-one alignment between concepts and properties (including object properties and data properties). The core of CSA is the utilizing the advantage of each basic strategy for the alignment process. For example, the String based similarity measure works well in case the two entities are similar in linguistic while Structure based similarity measure is effective when the two entities have similar in their local structure. The system automatically combines many similarity measurements based on the analysis of their similarity matrix. Detail of the system is described in the following part. Fig. 1. The main process of CSA 1.2 Specific techniques used The process of the system is illustrated in Figure 1. First, we calculate five basic similarity measures. These similarities are String edit distance, WordNet based, Profile, Structure, and Instance based. Second, the weight for each similarity is estimated through a weight estimation process. Then, we aggregated these similarities based on their weights. After that, we propagate the similarity to get the final similarity matrix. The pre-alignment is then extracted from this matrix. Finally, we apply the pruning process to get the final alignment. Similarity Generation The similarity between entities in the two ontology is computed by five basic measures. The String edit distance measures the lexical feature of the entity’s name. The WordNet [3] exploits the similarity between words occur in the entity’s name. We use the method of Wu and Palmer for calculating the WordNet similarity [8]. The Profile similarity makes use of the id, label, comments information contained in entity. The profile for a class take their properties, instances into account. The profile for a property includes their domains and their ranges. Then we construct the weight feature vector using tf-idf. The similarity is then calculated by the cosine similarity of the two vectors. The Structure similarity is calculated for class only. This similarity measures the difference in the local structure of an entity. We implement the method introduced in [7] for the structure measure. This calculation is based on the difference of number of class’s children, number of class’s sibling, the normalized depth from root and the number of properties restricted to this class. The Instance based measure is similar to the Profile except that we only utilize the content of instances belong to classes and the properties appear in these instances. Weight estimation Weight estimation is the core of CSA. In this step, we analyse each similarity matrix of each basic measure to find which one is actually effective and which one is not for alignment process. The weight estimation process is based on two information. First, for each single method, the process of finding a threshold for distinguish matching pairs from non matching pairs can be viewed as a binary classification problem [5]. In this process the positive class contains matching pairs and the negative class contains non matching ones. Second, in one-to-one ontology alignment, the maximum number of matching pairs is equal to the minimum number of entities in the two ontologies. Then if a single method is effective, its correspondent similarity matrix must have the two criteria: The matrix must have the ability of distinguishing matching pairs from non matching pairs and the number of matching pairs must approximate the minimum number of entities in the two ontologies. Base on these criteria, we model the weight estimation process for concept as follow: First, for each similarity matrix we use K-means algorithm to cluster the similarity values into two different classes (k = 2). The feature is the similarity value of each pairs of classes. The cluster with higher mean represents the matching set, and the lower one represents the non matching set. Then we filter out all the values belong to the non matching set. What remains is the similarity matrix with the higher values. Then we calculate the number of row that has value in the matrix. These row represent the possible matching pairs. Because in our case we consider the one-to-one matching, then one concepts from source ontology is only matched up to one concepts from target ontology. Finally, the weight is estimated by the ratio of the number of row over the number of matched values in the filtered matrix. weight = |number of row that has value| |number of value in matching set| (1) The weight estimation for property similarity matrix is calculated in the same manner. Similarity Aggregation The similarity combination can be defined as the weight average of the five basic measures. The weight for each measure is estimated in the previous step. Simcombine (e1, e2) = ∑n i=1 weighti × Simi (e1, e2) ∑n i=1 weighti (2) Similarity Propagation This step consider the impact of structural information on the similarity between entity pair in the aggregated matrix. The intuition is that the more similar in structure two entities are, the more similar they are. To exploit the structure information, we use the method of Descendant Similarity Inheritance [1]. Extraction In our system, only one-to-one matching is allowed. The final similarity matrix can be viewed as a bipartite graph with the first set of vertices are entities from source ontology and the second set of vertices are entities from target ontology. Thus, the alignment extraction can be modelled as the process of finding the mapping from the bipartite graph. To solve this, we apply the stable marriage problem algorithm [4]. We models the two set of entities as sets of men and women. For each man and each woman, in the correspondence set, a list of priority of men and women is created based on their similarity value. The stable marriage algorithm is then applied to find the stable mapping between two sets. The result is the pre-alignment. Table 1. The performance of CSA on the benchmark track Ontology Prec. Rec. 101 1.0 1.0 201 202 0.84 0.71 221 247 0.97 0.99 248 252 0.76 0.63 253 259 0.77 0.57 260 266 0.62 0.51 Average 0.79 0.66 Pruning This is the final step of our system. In this step we filter out a proportion of entities pair that have low confidence to increase the precision of our system. For the threshold, we set it manually. The result is the final alignment of the two ontologies. 1.3 Adaptations made for the evaluation We do not make any specific adaptation for the OAEI 2011 campaign. The three track are run in the same set of parameter. 1.4 Link to the system and parameters file The CSA system can be downloaded from seal-project at http://www.seals-project. eu/. 1.5 Link to the set of provided alignments (in align format) The result of CSA system can be downloaded from seal-project at http://www. seals-project.eu/.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Aggregation of similarity measures in ontology matching

This paper presents an aggregation approach of similarity measures for ontology matching called n-Harmony. The n-Harmony measure identifies top-n highest values in each similarity matrix to assign a weight to the corresponding similarity measure for aggregation. We can also exclude noisy similarity measures that have a low weight and the n-Harmony outperforms previous methods in our experimenta...

متن کامل

A Dynamic Multistrategy Ontology Alignment Framework Based on Semantic Relationship using WordNet

Ontology matching has emerged as a crucial step when information sources are being integrated. Hence, ontology matching has attracted considerable attention in both academia and industry. Clearly, as information sources grow rapidly, manual ontology matching becomes tedious, time-consuming and leads to errors and frustration. Thus the need for automated and semi-automated approaches becomes inc...

متن کامل

Structural Weights in Ontology Matching

Ontology matching finds correspondences between similar entities of different ontologies. Two ontologies may be similar in some aspects such as structure, semantic etc. Most ontology matching systems integrate multiple matchers to extract all the similarities that two ontologies may have. Thus, we face a major problem to aggregate different similarities. Some matching systems use experimental w...

متن کامل

SemMatcher: A Tool for Matching Ontology-based Schemas

In Peer Data Management Systems (PDMS), each peer is an autonomous source that makes available a local schema. Information exchange occurs through the establishment of schema mappings between local schemas. To help matters, ontologies have been considered as uniform representation of local schemas (i.e., peer ontologies). Consequently, ontology matching techniques have been used to determine sc...

متن کامل

Semantic Ontology Method of Learning Resource based on the Approximate Subgraph Isomorphism

Digital learning resource ontology is often based on different specification building. It is hard to find resources by linguistic ontology matching method. The existing structural matching method fails to solve the problem of calculation of structural similarity well. For the heterogeneity problem among learning resource ontology, an algorithm is presented based on subgraph approximate isomorph...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011